AITopics | evaluation procedure

Collaborating Authors

evaluation procedure

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

f6ccbf94fa57c2ae372ece91b537574d-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsApr-30-2026, 08:24:50 GMT

artificial intelligence, machine learning, modeling & simulation, (15 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report (0.93)

Industry:

Energy (0.46)
Health & Medicine (0.46)
Information Technology (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Modeling & Simulation (0.93)
(2 more...)

Add feedback

The Sea Surface Height Edition J. Emmanuel Johnson

Neural Information Processing SystemsFeb-18-2026, 00:40:05 GMT

The ocean is a crucial component of the Earth's system.

artificial intelligence, machine learning, modeling & simulation, (15 more...)

Neural Information Processing Systems

Country:

Southern Ocean (0.04)
Pacific Ocean (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(5 more...)

Genre: Research Report (0.93)

Industry:

Energy (0.46)
Health & Medicine (0.46)
Information Technology (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Modeling & Simulation (0.93)
(2 more...)

Add feedback

13836f251823945316ae067350a5c366-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-8-2026, 05:34:46 GMT

allenai discoveryworld tree, allenai discoveryworld tree main data, benchmark, (9 more...)

Neural Information Processing Systems

Country: North America > United States > Arizona (0.05)

Technology: Information Technology > Artificial Intelligence (0.51)

Add feedback

2b64c2f19d868305aa8bbc2d72902cc5-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-7-2026, 22:25:19 GMT

evaluation procedure, predictor, revision, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science (0.36)
Information Technology > Artificial Intelligence (0.30)

Add feedback

Improving LLM's Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

Sheikhi, Hadi, Huang, Chenyang, Zaïane, Osmar R.

arXiv.org Artificial IntelligenceNov-18-2025

Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs' attachment on external knowledge.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.11946

Country:

North America > United States (1.00)
Europe (0.93)
Asia > Middle East > UAE (0.46)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

13836f251823945316ae067350a5c366-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 19:00:59 GMT

allenai discoveryworld tree, allenai discoveryworld tree main data, benchmark, (9 more...)

Neural Information Processing Systems

Country: North America > United States > Arizona (0.05)

Technology: Information Technology > Artificial Intelligence (0.51)

Add feedback

replace the

Neural Information Processing SystemsOct-2-2025, 13:31:11 GMT

We thank the reviewers for their valuable input on how to improve our manuscript. We use our evaluation procedure ( 4) since we will not have ground-truth outcomes. The revision will provide this discussion with relevant citations. We would like to clarify that Theorem 3.1 describes the conditions under which our method is optimal. The RF estimation error dominates the confounding error.

artificial intelligence, predictor, revision, (12 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science (0.36)
Information Technology > Artificial Intelligence (0.30)

Add feedback

ClonEval: An Open Voice Cloning Benchmark

Christop, Iwona, Kuczyński, Tomasz, Kubis, Marek

arXiv.org Artificial IntelligenceSep-18-2025

ABSTRACT We present a new benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of the leaderboard. The evaluation results of selected open-source models are reported.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.20581

Genre: Research Report (0.70)

Industry: Information Technology > Security & Privacy (0.87)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)

Add feedback

Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications

Sauer, Christina, Boulesteix, Anne-Laure, Hanßum, Luzia, Hodiamont, Farina, Bausewein, Claudia, Ullmann, Theresa

arXiv.org Machine LearningDec-4-2024

Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.

algorithm, prediction model, procedure, (14 more...)

arXiv.org Machine Learning

2412.03491

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(4 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Providers & Services (0.69)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.45)

Add feedback

A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task

Fujisaki, Yuya, Takagi, Shiro, Asoh, Hideki, Kumagai, Wataru

arXiv.org Artificial IntelligenceSep-10-2024

The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at https://github.com/auto-res/PaperRQ-HumanAnno-Dataset.

dataset, evaluation, evaluation function, (13 more...)

arXiv.org Artificial Intelligence

2409.06883

Country: